library(dplyr)
library(tidyr)
library(targets)
library(ggplot2)
# get pipsi
pips <- tar_read(constant_pips)
X_order <- unique(pips$X_name)[c(3,1,2)]
pips$X_name <- factor(pips$X_name, levels = X_order)

cs <- tar_read(constant_cs)
cs$X_name <- factor(cs$X_name, levels = X_order) 

0.1 Overview

0.1.1 Simulations

We assess the power and calibration of PIPs, as well as coverage of 95% credible sets, for various SERs and SuSiEs.

We simulate n=500 samples and p=100 features. We simulate a binary design matrix X, to simulate the gene set enrichment scenario. Each column of X has 20% of it’s entries set to 1. We control the correlation between adjacent columns of X by copying the previous column and randomly flipping a small proportion of 1’s to 0’s (p_flip), and a corresponding number of 0’s to 1’s. By setting p_flip to a small value, we can generate highly correlated X, conversely setting it to large value quickly decouples the features of X. We generate X with p_flip = 0.02, 0.1, 0.5 which we call X_bin_strong, X_bin_medium and X_bin_weak respectively, to explore the performance of our variable selection methods across a range of correlation strenghts.

We generate binary observatons y under a simple model. We select \(L\) columns of X to be enriched. If \(x_{ij} = 1\) for at least 1 of the enriched columns, \(y_i = 1\) with probability \(p_{active}\), otherwise \(y_i =1\) with probability \(p_{background}\). When \(L=1\) this corresponds with an additive model on the log-odds scale.

We simulate under background_logit = -10, -4, -2. and active_logit = -1, -0.5, 0, selecting L=1, 2, 3 enriched features. We repeat each simulation setting for 20 iterations.

0.1.2 Methods Assessed

We assess the performance of the linear SER and several approximations to the logistic SER. We also assess the performance of several SuSiE variants. As a baseline we run linear SuSiE. We fit logistic SuSiE with the global variational approximation, IBSS, and IBSS2M.

0.2 SERs

0.2.1 PIP Calibration

We report the calibration of posterior inclusion probabilities (PIPs). The PIP summarizes the marginaly probability that each feature is included in the model. We bin features across all simulations by there PIPs, and assess the frequency that features in each PIP bin are actually enriched. If thre observed frequency closely matches the average value of PIPs in each bin, we say that the PIPs are well calibrated.

Overall the Univariate VB SER shows the best calibration, even being slightly conservative at times. The VB SER may have slightly anti-conservative PIPs, which is consistent with our expectation that the variational approximation will tend to favor whichever feature has the strongest evidence of association.

0.2.1.1 SERs, \(L=1\)

pips %>%
  filter_ser() %>%
  filter(L==1) %>%
  make_pip_calibration_plot() +
  facet_wrap(vars(fit_method))
## Warning: Removed 5 rows containing missing values (geom_pointrange).

0.2.1.2 SERs, \(L > 1\)

pips %>%
  filter_ser() %>%
  filter(L>1) %>%
  make_pip_calibration_plot() +
  facet_wrap(vars(fit_method))
## Warning: Removed 5 rows containing missing values (geom_pointrange).

0.2.2 FDP vs Power

In the single effect simulations we see that the Univariate VB SER has greater power than the VB SER, and comparable performance to the linear SER. Since among equally powered approaches we prefer the better calibrated one, the UVB SER shows the best performance here.

When there are multiple effects we see that the performance of the linear model takes a hit. The VB SER does a slightly better job ordering PIPs close to 1, but overall performance is comparable.

0.2.2.1 \(L=1\)

pips %>%
  filter_ser() %>%
  filter(L==1) %>%
  make_fdp_plot()

0.2.2.2 \(L>1\)

pips %>%
  filter_ser() %>%
  filter(L>1) %>%
  make_fdp_plot()

0.2.3 95% credible set coverage

As the correlation structure in the design matrix increases, and as the number of enriched features increases we see a loss of coverage in all models. The credible sets from the UVB SER tend to have better coverage than the other methods. We note that the loss of coverage in the VB model as correlation structure increases is expected. Loss of coverage when \(L>1\), and in the linear model, may be attributable to model mis-specification.

0.2.3.1 All credible sets

cs %>%
  filter_ser() %>%
  make_cs_coverage_plot() +
  facet_wrap(vars(X_name))

0.2.3.1.1 \(L=1\)
cs %>%
  filter_ser() %>%
  filter(L[1]==1) %>%
  make_cs_coverage_plot() +
  facet_wrap(vars(X_name))

0.2.3.1.2 \(L>1\)
cs %>%
  filter_ser() %>%
  filter(L[1]>1) %>%
  make_cs_coverage_plot() +
  facet_wrap(vars(X_name))

0.2.3.2 Credible sets with \(\leq 25\) feeatures

cs %>%
  filter_ser() %>%
  make_cs_coverage_plot(max_cs_size = 25) +
  facet_wrap(vars(X_name))

0.2.3.2.1 \(L=1\)
cs %>%
  filter_ser() %>%
  filter(L[1]==1) %>%
  make_cs_coverage_plot(max_cs_size=25) +
  facet_wrap(vars(X_name))

0.2.3.2.2 \(L>1\)
cs %>%
  filter_ser() %>%
  filter(L[1]>1) %>%
  make_cs_coverage_plot(max_cs_size=25) +
  facet_wrap(vars(X_name))

0.2.3.3 CS Size Plots

cs %>% 
  filter_ser() %>%
  make_coverage_by_cs_size()

cs %>% 
  filter_ser() %>%
  make_cs_size_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

cs %>% 
  filter_ser() %>%
  make_pairwise_cs_size_plot()
## `summarise()` has grouped output by 'fit_method', 'hash'. You can override
## using the `.groups` argument.
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## Warning: Removed 255 rows containing non-finite values (stat_density).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 286 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 688 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 371 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 712 rows containing missing values
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 286 rows containing non-finite values (stat_bin2d).
## Warning: Removed 230 rows containing non-finite values (stat_density).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 662 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 339 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 692 rows containing missing values
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 688 rows containing non-finite values (stat_bin2d).
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 662 rows containing non-finite values (stat_bin2d).
## Warning: Removed 490 rows containing non-finite values (stat_density).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 491 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 616 rows containing missing values
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 371 rows containing non-finite values (stat_bin2d).
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 339 rows containing non-finite values (stat_bin2d).
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 491 rows containing non-finite values (stat_bin2d).
## Warning: Removed 131 rows containing non-finite values (stat_density).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 537 rows containing missing values
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 712 rows containing non-finite values (stat_bin2d).
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 692 rows containing non-finite values (stat_bin2d).
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 616 rows containing non-finite values (stat_bin2d).
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 537 rows containing non-finite values (stat_bin2d).
## Warning: Removed 534 rows containing non-finite values (stat_density).

0.2.3.3.1 \(L=1\)
cs %>%
  filter_ser() %>%
  filter(L[1] == 1) %>%
  make_coverage_by_cs_size()

cs %>%
  filter_ser() %>%
  filter(L[1] == 1) %>%
  make_cs_size_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

0.2.3.3.2 \(L>1\)
cs %>%
  filter_ser() %>%
  filter(L[1] > 1) %>%
  make_coverage_by_cs_size()

cs %>%
  filter_ser() %>%
  filter(L[1] > 1) %>%
  make_cs_size_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

0.3 SuSiE

0.3.1 PIP Calibration

Calibration of PIPs in the single-effect simulations

0.3.1.1 \(L=1\)

pips %>%
  filter_susie() %>%
  filter(L==1) %>%
  make_pip_calibration_plot() +
  facet_wrap(vars(fit_method))

0.3.1.2 \(L > 1\)

pips %>%
  filter_susie() %>%
  filter(L>1) %>%
  make_pip_calibration_plot() +
  facet_wrap(vars(fit_method))

0.3.2 FDP vs Power

0.3.2.1 \(L=1\)

pips %>%
  filter_susie() %>%
  filter(L==1) %>%
  make_fdp_plot()

0.3.2.2 \(L > 1\)

pips %>%
  filter_susie() %>%
  filter(L>1) %>%
  make_fdp_plot()

0.3.3 95% credible set coverage

IBSS2M is the clear winner here. We have good coverage across all X and for multiple enriched features. We note that the suprisingly low coverage of the other methods when assessing all credible sets is a bit misleading– in practice we fit SuSiE with more components that needed, and this leads to a fit model with uninformative components. What we are seeing here is that when SuSiE correctly identifies a feature in one component, the uninformative components are less likely to include that enriched feature. Even crude filtering of the credible sets remedies this. For example for X_bin_weak we see that all “small” credible sets (ie probably not a null component) have good empirical coverage.

0.3.3.1 All credible sets

cs %>%
  filter_susie() %>%
  make_cs_coverage_plot() +
  facet_wrap(vars(X_name))

0.3.3.1.1 \(L=1\)
cs %>%
  filter_susie() %>%
  filter(L[1]==1) %>%
  make_cs_coverage_plot() +
  facet_wrap(vars(X_name))

0.3.3.1.2 \(L > 1\)
cs %>%
  filter_susie() %>%
  filter(L[1] > 1) %>%
  make_cs_coverage_plot() +
  facet_wrap(vars(X_name))

0.3.3.2 Credible sets with \(\leq 25\) features

cs %>%
  filter_susie() %>%
  make_cs_coverage_plot(max_cs_size = 25) +
  facet_wrap(vars(X_name))

0.3.3.2.1 \(L=1\)
cs %>%
  filter_susie() %>%
  filter(L[1] == 1) %>%
  make_cs_coverage_plot(max_cs_size = 25) +
  facet_wrap(vars(X_name))

0.3.3.2.2 $ L > 1$
cs %>%
  filter_susie() %>%
  filter(L[1] > 1) %>%
  make_cs_coverage_plot(max_cs_size = 25) +
  facet_wrap(vars(X_name))

0.3.3.3 CS Size Plots

cs %>% 
  filter_susie() %>%
  make_coverage_by_cs_size()

cs %>% 
  filter_susie() %>%
  make_cs_size_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

cs %>% 
  filter_susie() %>%
  make_pairwise_cs_size_plot()
## `summarise()` has grouped output by 'fit_method', 'hash'. You can override
## using the `.groups` argument.
## Warning: Removed 16 rows containing non-finite values (stat_density).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 88 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 19 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 20 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 18 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 22 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 19 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 50 rows containing missing values
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 88 rows containing non-finite values (stat_bin2d).
## Warning: Removed 79 rows containing non-finite values (stat_density).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 85 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 86 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 88 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 90 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 88 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 109 rows containing missing values
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 19 rows containing non-finite values (stat_bin2d).
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 85 rows containing non-finite values (stat_bin2d).
## Warning: Removed 13 rows containing non-finite values (stat_density).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 14 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 18 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 22 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 16 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 46 rows containing missing values
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 20 rows containing non-finite values (stat_bin2d).
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 86 rows containing non-finite values (stat_bin2d).
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 14 rows containing non-finite values (stat_bin2d).
## Warning: Removed 14 rows containing non-finite values (stat_density).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 19 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 23 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 17 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 47 rows containing missing values
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 18 rows containing non-finite values (stat_bin2d).
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 88 rows containing non-finite values (stat_bin2d).
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 18 rows containing non-finite values (stat_bin2d).
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 19 rows containing non-finite values (stat_bin2d).
## Warning: Removed 15 rows containing non-finite values (stat_density).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 23 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 17 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 49 rows containing missing values
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 22 rows containing non-finite values (stat_bin2d).
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 90 rows containing non-finite values (stat_bin2d).
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 22 rows containing non-finite values (stat_bin2d).
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 23 rows containing non-finite values (stat_bin2d).
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 23 rows containing non-finite values (stat_bin2d).
## Warning: Removed 20 rows containing non-finite values (stat_density).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 23 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 53 rows containing missing values
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 19 rows containing non-finite values (stat_bin2d).
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 88 rows containing non-finite values (stat_bin2d).
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 16 rows containing non-finite values (stat_bin2d).
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 17 rows containing non-finite values (stat_bin2d).
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 17 rows containing non-finite values (stat_bin2d).
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 23 rows containing non-finite values (stat_bin2d).
## Warning: Removed 14 rows containing non-finite values (stat_density).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 48 rows containing missing values
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 50 rows containing non-finite values (stat_bin2d).
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 109 rows containing non-finite values (stat_bin2d).
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 46 rows containing non-finite values (stat_bin2d).
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 47 rows containing non-finite values (stat_bin2d).
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 49 rows containing non-finite values (stat_bin2d).
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 53 rows containing non-finite values (stat_bin2d).
## Warning: Ignoring unknown parameters: yintercept
## Warning: Removed 48 rows containing non-finite values (stat_bin2d).
## Warning: Removed 34 rows containing non-finite values (stat_density).

0.3.3.3.1 \(L = 1\)
cs %>% 
  filter_susie() %>%
  filter(L[1] == 1) %>%
  make_coverage_by_cs_size()

cs %>% 
  filter_susie() %>%
  filter(L[1] == 1) %>%
  make_cs_size_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

0.3.3.3.2 \(L > 1\)
cs %>% 
  filter_susie() %>%
  filter(L[1] > 1) %>%
  make_coverage_by_cs_size()

cs %>% 
  filter_susie() %>%
  filter(L[1] > 1) %>%
  make_cs_size_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.